Given a probabilistic model and a particular observation, Bayesian inference is straightforward to implement. Inferences, and any decisions based upon them, follow immediately in the form of expectations with respect to the corresponding posterior distribution. Building a probabilistic model that is useful in a given application, however, is a far more open-ended challenge. Unfortunately the process of model building is often ignored in introductory texts, leaving practitioners to piece together their own model building workflow from potentially incomplete or even inconsistent heuristics.
In this case study I introduce a principled workflow for building and evaluating probabilistic models in Bayesian inference. This workflow is based on recent research in collaboration with Dan Simpson, Aki Vehtari, Sean Talts, and others; see, for example, Gabry et al. (2017). The specific workflow that I advocated here, however, is my particular take on this research that is focused on the needs of modeling and inference in the physical sciences. It does not necessarily reflect the perspectives of any of my collaborators and may prove to be limiting in other applications.
Moreover, as this workflow is an active topic of research it is subject to evolution and refinement. Use it at your own risk. But at the same time don’t use at your own risk. Better yet build a calibrated model, infer the relative risks, and then make a principled decision…
We begin with a review of Bayesian inference and then a discussion of what properties we want in a probabilistic model and in what ways we can validate those properties. I then introduce a workflow that implements these validations to guide the development of a useful model, emphasizing the role of each step and the expertise needed to implement them well. Once the workflow has been laid out I demonstrate its application on a seemingly simple analysis that hides a few surprises.
In order to avoid any confusion due to the varied notations and vocabularies that abound in the statistical literature, let’s first review the foundational concepts in Bayesian inference. Be aware that I’ll be taking an experimental perspective here, and that perspective may clash with different philosophical perspectives on Bayesian inference, or statistics in general, that are often emphasized in introductory texts. Fortunately, in most cases the differences between these perspectives don’t have any consequences for how the resulting inferences are implemented in practice.
Let’s first consider some phenomenon about which we want to learn. Further let’s assume that this phenomenon manifests in some latent system with which we can interact. Importantly, this latent system contains not only the phenomenon of interest but also the surrounding environment that exhibits the phenomenon.
In order to learn about this phenomenon we need to design an experiment that implements a measurement process. The resulting measurements probes the latent system, generating data that is hopefully sensitive to the phenomenon of interest. Note that I use the term experiment generally here to include also processes that collect, curate, and transform existing observations. More formally we say that the realizations of the measurement process results is an observation, \(\tilde{y}\), that is an element of an observation space, \(Y\).
The measurement process however, is not deterministic. There may be ontological variability that results in different observations every time we probe the latent system, or epistemic uncertainty that limits how accurately we can resolve the measurement. Regardless of its interpretation, we presume that this complication is sufficiently regular that it can be quantified in a probability distribution over the observation space that we denote the true data generating process, \(\pi^{\dagger}\). Because of this limitation of the measurement process, the best we can ultimately learn about the underlying latent system, and hence the phenomenon of interest, is this true data generating process.
Although we presume that a true data generating process exists, we don’t know what form it will take in a given application. Unfortunately the space of all possible data generating processes, \(\mathcal{P}\), is far too large and mathematically ungainly to consider, so in practice we have to limit our search to a small world, \(\mathcal{S}\), or observational model of possible data generating processes.
Each element of the observational model, or each model configuration, defines a probability distribution over the observation space which quantifies a mathematical narrative of how the data could be generated, encapsulating the phenomenon of interest as well as any systematic phenomena encountered from the latent system through the measurement process. We can then use the assumed observational model to learn from observations, making inferences and informing decisions about the latent system that we’re studying. In practice we specify an observational model with a collection of probability densities, \(\pi_{\mathcal{S}} (y ; \theta)\), with each parameter, \(\theta\), identifying a unique model configuration.
Bayesian inference encodes information about the observational model in probability distributions – the more probability a distribution assigns to a neighborhood of model configurations, the more consistent those model configurations are with our available information. From this perspective the model configuration space is no longer just a collection of data generating processes but also a conditional probability distribution over the product of the observation space and the small world, \(Y \times \mathcal{S}\), \[ \pi_{\mathcal{S}}(y ; \theta) = \pi_{\mathcal{S}}(y \mid \theta). \]
Introducing a prior distribution, \(\pi_{\mathcal{S}}(\theta)\), that encodes domain expertise about the measurement process and latent system complements the observational model to give a Bayesian joint distribution, \[ \pi_{\mathcal{S}}(y, \theta) = \pi_{\mathcal{S}}(y ; \theta) \, \pi_{\mathcal{S}}(\theta). \] Given the an explicit observation, \(\tilde{y}\), this joint distribution then defines a posterior distribution that concentrates on those model configurations consistent with both our domain expertise and the measurement, \[ \pi_{\mathcal{S}}(\theta \mid \tilde{y}) = \frac{ \pi_{\mathcal{S}}(\tilde{y}, \theta) } { \int \mathrm{d} \theta \, \pi_{\mathcal{S}}(\tilde{y}, \theta) } \propto \pi_{\mathcal{S}}(\tilde{y}, \theta). \]
We can also interpret this process as updating the prior distribution into a posterior distribution, \[ \pi_{\mathcal{S}}(\theta \mid \tilde{y}) = \left( \frac{ \pi_{\mathcal{S}}(\tilde{y} \mid \theta) } { \pi_{\mathcal{S}}(\tilde{y}) } \right) \pi_{\mathcal{S}}(\theta), \] which makes the inherent learning process more evident. Our information after the observation, encoded in the posterior, is just our information before the observation, encoded in the prior, plus the information we learn from the observation, encoded in the ratio \[ \frac{ \pi_{\mathcal{S}}(\tilde{y}, \theta) } { \int \mathrm{d} \theta \, \pi_{\mathcal{S}}(\tilde{y}, \theta) }. \]
Any inferential query we have can then be answered with an expectation value of an appropriate function with respect to the posterior distribution. For example, where in the small world the most consistent model configurations concentrate can be quantified with the mean or the median of the posterior distribution. Similarly, the breadth of this concentration can be quantified with the standard deviation or tail quantiles of the posterior distribution.
Given an observational model and a prior distribution, or equivalently the joint distribution, \(\pi_{\mathcal{S}}(y, \theta)\), Bayesian inference is straightforward to implement. At least conceptually, The utility of any inferences we derive, however, are completely dependent on this assumed model.
A poor model might readily yield inferences, but those inferences may be of little use in practice and may even be dangerous in critical applications.
What, however, characterizes a useful probabilistic model? Ideally the model would be consistent with our domain expertise, capturing not necessarily all of our knowledge of the experimental design but just enough to admit answers to the scientific questions of interest. It would also help if the model didn’t obstruct our available computational tools so that we could accurately compute posterior expectation values and faithfully implement out inferences in practice. Finally, we would hope that our model is rich enough to capture the structure of the true data generating process that is needed to answer our scientific questions.
Consequently we can verify the utility of a model in a given application by trying to answer four questions.
Is our model consistent with our domain expertise?
Are our computational tools sufficient to accurately fit the model?
How do we expect our inferences to perform over the possible realizations of the measurement process?
Is our model rich enough to capture the relevant structure of the true data generating process?
These questions can be challenging to answer even in simple applications, but with careful analysis of a given model we can assemble the necessary evidence to answer each of the four questions well enough to identify practical problems with the model.
A common refrain is to “let the data speak for itself”. In a statistical analysis, however, the data speak only through the observational model, and the observational model isn’t always so talkative.
When the likelihood function, \(\pi_{\mathcal{S}}(\tilde{y} \mid \theta)\), concentrates in a small neighborhood of parameter space the structure of the prior isn’t particularly important. The information we learn from the observation dominates the form of the posterior distribution.
The observational model, however, isn’t always so well-behaved. Sometimes the experimental design isn’t sufficiently sensitive to the phenomena of interest, which manifests in a weakly-identified likelihood function that spans huge regions of the model configuration space, or even a non-identified likelihood function that extends all the way to infinity. In these cases the form of the posterior is strongly influenced by the form of the prior, and a careless prior propagates to a careless posterior.
In order to compensate for poor identification in the likelihood function we need the prior distribution to incorporate just enough domain expertise to suppress extreme, although not impossible, model configurations.
That said, the success of a weakly-informative prior like this depends on how the prior interacts with the likelihood function, and in sophisticated models these interactions can be difficult to analyze. In practice it’s often easiest to study how these interactions affect observations predicted by the model.
The prior predictive distribution, \[ \pi_{\mathcal{S}}(y) = \int \mathrm{d} \theta \, \pi_{\mathcal{S}}(y, \theta), \] averages each data generating process within the observational model over the prior distribution, quantifying the range of observations deemed reasonable within the scope of our modeling assumptions. A model predicting too many observations that are considered extreme within the scope of our domain expertise then indicates conflict between the modeling assumptions and that domain expertise.
Analyzing the posterior predictive distribution, however, is not particularly straightforward when the observation space is high-dimensional. In this case we have to consider summary statistics that project the observational space to some lower dimensional space that’s more amenable to visualization, and hence practical comparison to our often implicit domain expertise. For example we might project the observations to a one-dimensional summary, \(t : Y \rightarrow \mathbb{R}\), or even a structured summary like a histogram or empirical cumulative distribution function. Constructing an interpretable yet informative summary statistic is very much a fine art.
Once a summary statistic has been chosen we can then identify which values of the summary statistic correspond to observations that begin to become extreme. For example, if we’re observing how quickly a drug propagates through a patient’s blood stream we can be fairly confident that the observed propagation speed will be less than the speed of sound in water as a drug dose traveling that fast would have unfortunately physical consequences. Similarly, if we’re observing carbon in the air then we don’t want our model predicting carbon concentrations more characteristic of solid concrete than gaseous, breathable air. If we’re observing the migration of birds then we can be fairly sure that we won’t observe any particularly healthy individuals cruising near the speed of light. In practice these thresholds are readily identified with a cursory examination of the available domain expertise, especially if the summary statistics are well-chosen.
With summary statistics and thresholds in hand prior predictive checks follow immediately. We simply push the prior predictive distribution forward along each the summary statistic, the corresponding probability density function \(\pi_{t(Y)}(t)\) here show in dark red, and then consider how much probability leaks past the threshold and into the extreme values, here shown in light red.
In practice we typically can’t construct the probability density function for this pushforward distribution analytically, but we can approximate it using samples from the joint distribution. Sampling \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta}) \] yields samples from the joint distribution, \((\tilde{y}, \tilde{\theta})\), and evaluating the summary statistic at the simulated observations, \[ t(\tilde{y}) \] yields samples from the pushforward of the prior predictive distribution. We can then histogram these samples to implement the posterior predictive check.
The model shouldn’t predict no observations past the thresholds – extreme observations are unreasonable but not impossible, after all. Instead what we want to look for is excessive predictions surpassing the thresholds, and this is a more qualitative than quantitative judgement. Indeed visualizations of the pushforward distributions are a powerful way to elicit qualitative information from domain experts unfamiliar with statistics. The emphasis on the observation space is often more interpretable than the configurations of a specific observational model. Moreover, the emphasis on extremes allows those experts to reject bad assumptions, which they are often far more eager to do opposed to accepting good assumptions.
The idea of analyzing the prior predictive distribution goes back to Good’s device of imaginary results (Good 1950), which conveniently also gives its user +2 to Wisdom saving throws.
A model is only as good as our ability to wield it. In Bayesian inference, a model is only as good as our ability to compute expectation values with respect to the posterior distributions it generates, which is often denoted fitting the model. Robust methods that offer a self-diagnostics capable of identifying inaccurate computation, such as Hamiltonian Monte Carlo, are extremely powerful in this regard as they provide some confidence that we can realize the inferences of our model.
If these diagnostics are available then the prior predictive distribution proves further useful. Fitting observations simulated from the prior predictive distribution allows us check the performance of our chosen computational method over the range of reasonable posterior distribution behaviors.
What can we do, however, if our method isn’t self-diagnostic? Fortunately Bayesian inference implies a subtle consistency property that allows us to check any method capable of generating posterior samples. Averaging the posterior distributions fit from observations drawn from the prior predictive distribution will always recover the prior distribution, \[ \pi_{\mathcal{S}}(\theta') = \int \mathrm{d} y \, \mathrm{d} \theta \, \pi_{\mathcal{S}} (\theta' \mid y) \, \pi_{\mathcal{S}} (y, \theta). \] In particular this implies that the ensemble of posterior samples, \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta}) \\ \tilde{\theta}' \sim \pi(\theta \mid \tilde{y}), \] will always be identical to samples from the prior distribution. Any deviation between the two samples indicates that either our model has been incorrectly implemented or our computational method is generating biased samples.
Simulated-based calibration (Talts et al. 2018) compares the ensemble posterior sample and the prior sample using ranks. For each simulated observation we generate \(R\) samples from the corresponding posterior distribution, \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta}) \\\ (\tilde{\theta}'_{1}, \ldots, \tilde{\theta}'_{R}) \sim \pi(\theta \mid \tilde{y}), \] and compute the rank of the prior sample within the posterior samples, i.e. the number of posterior samples larger than the prior sample, \[ \rho = \sharp \left\{ \tilde{\theta} < \tilde{\theta}'_{r} \right\}. \] If the ensemble posterior sample and the prior sample are equivalent then these ranks should be uniformly distributed, which we can quickly analyze for each parameter in the model configuration space using a histogram.
Here the grey band demonstrates the expected variation of ranks distributed according to an exact uniform distribution. Deviation outside of these bands, especially systematic deviation of many bins at once, indicates computational problems. See Talts et al. (2018) for further discussion of common systematic deviations and their interpretations.
Technically this method requires exact posterior samples, which are not available if we using, for example, Markov chain Monte Carlo. To use simulation based calibration with Markov chain Monte Carlo we have to first thin our Markov chains to remove any effective autocorrelation. See Talts et al. (2018) for more details.
The scope of reasonable observations and model configurations quantified in the Bayesian joint distribution also provides a way to analyze the range of inferential outcomes we should expect, and in particular calibrate them with respect to the simulated truths (Betancourt 2018).
For example, let’s say that we have some decision making process that takes in an observation as input and returns an action from the space of possible actions, \(A\), \[ \begin{alignat*}{6} a :\; &Y& &\rightarrow& \; &A& \\ &y& &\mapsto& &a(y)&. \end{alignat*} \] This process might be a Bayesian decision theoretical process informed by posterior expectations, but it need not be. It could just as easily be a process to reject or accept a null hypothesis using an orthodox significance test.
Further let’s say that the benefit of taking action \(a \in A\) when the true data generating process is given by the model configuration \(\theta\) is \[
\begin{alignat*}{6}
U :\; &A \times \mathcal{S}& &\rightarrow& \; &\mathbb{R}&
\\
&(a, \theta)& &\mapsto& &U(a, \theta)&.
\end{alignat*}
\] For example this might be a false discovery rate or the benefits of an intervention minus any of its implementation costs. In this context the utility of a decision making process for a given observation becomes
\[
\begin{alignat*}{6}
U :\; &Y& &\rightarrow& \; &\mathbb{R}&
\\
&(y, \theta)& &\mapsto& &U(a(y), \theta)&.
\end{alignat*}
\]
Before the measurement process resolves we don’t know what the observation will be, nor do we know what the true data generating process will be. If we assume that our model is rich enough to capture the true data generating process, however, then the joint distribution quantifies the possibilities. Consequently we can construct the distribution of possible utilities by pushing the Bayesian joint distribution forward along the utility function to give \(\pi_{U(A, \mathcal{S})}\).
As with the prior predictive check, we can approximate the probability density function of this pushforward distribution in practice using samples from the joint distribution, \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta}) \\\ U(a(\tilde{y}), \tilde{\theta}). \]
This distribution of utilities is particularly useful for understanding how sensitive the measurement process is to the relevant features of the latent system that we’re studying. By carefully quantifying what we want to learn in the experiment into a utility function, we can formalize how capable our model is at answering those questions.
We can also take a less application-specific approach and use the Bayesian joint distribution to characterize the general performance of our inferences and capture any pathologies that might be lurking within the assumptions of our model.
Consider, for example, the posterior \(z\)-score of a given parameter in the model configuration space, \[ z = \left| \frac{ \mu_{\mathrm{post}} - \tilde{\theta} } { \sigma_{\mathrm{post}} } \right|. \] The posterior \(z\)-score quantifies how accurately the posterior recovers the true model configuration, with smaller values indicating a posterior that concentrates more tightly around the corresponding true parameter and larger values indicating a posterior that concentrates elsewhere.
Similarly, the posterior shrinkage of a given parameter, \[ s = 1 - \frac{ \sigma^{2}_{\mathrm{post}} } { \sigma^{2}_{\mathrm{prior}} }, \] quantifies how much the posterior learns about that parameter from a given observation. Posterior shrinkage near zero indicates that the data provide little information beyond the domain expertise encoded in the prior distribution, while posterior shrinkage near one indicates observations that strongly inform this parameter.
Conveniently, these two posterior expectations can capture an array of pathological behavior that typically compromises our inferences. For example, a posterior with high posterior shrinkage and high posterior \(z\)-score indicates a model that overfits to the variation in the measurement and is unable to accurately recover the true model configuration. A posterior with small posterior shrinkage and small posterior \(z\)-scores indicates a model that is poorly informed by the observations, where as a posterior with small posterior shrinkage and large posterior \(z\)-scores indicates a model with substantial conflict between the observational model and the prior distribution.
If we fit the posterior for each observation in an ensemble of samples from the Bayesian joint distribution, \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta}), \] then we can visualize the distribution of posterior \(z\)-scores and posterior shrinkages for each parameter, \[ z(\pi_{S}(\theta_{n} \mid \tilde{y}), \tilde{\theta}_{n}) \\ s(\pi_{S}(\theta_{n} \mid \tilde{y}), \pi_{S}(\theta_{n})) \] with a scatter plot of each posterior’s behavior and quickly identify potential problems.
Finally we can consider the adequacy of our model in capturing the relevant structure of the true data generating process.
If our model is sufficiently rich then the small world might contain the true data generating process and inferences within the small world might identify that true data generating process. In practice, however, reality is much richer than any model we could feasibly implement and the small world is unlikely to contain the true data generating process.
In this more realistic scenario the posteriors can’t concentrate around the true data generating process, and instead must be resort to concentrating around the model configurations within the small world that best approximate the true data generating process.
The pertinent question is then not whether or not our small world contains the true data generating process but rather whether or not our small world is close enough to the true data generating process. To paraphrase George Box, our model is probably wrong, but is it useful?
Consequently, in order to quantify the adequacy of our model we need to determine how close it is to the true data generating process, and in order to do that we need to summarize our inferences into a predictive distribution that we can directly compare to the true data generating process. For that we appeal to the posterior predictive distribution, \[ \pi_{\mathcal{S}}(y \mid \tilde{y}) = \int \mathrm{d} \theta \, \pi_{\mathcal{S}}(\theta \mid \tilde{y}) \, \pi_{\mathcal{S}}(y \mid \theta), \] which averages all of the data generating process within the observational model over the posterior distribution, quantifying the range of predictions consistent with both our domain expertise and the observation, \(\tilde{y}\).
If we want to compare the whole of the posterior predictive distribution and the true data generating process, then some reasonable assumptions quickly lead us to the Kullback-Leibler divergence, \(\mathrm{KL}(\pi^{\dagger}(y) \mid \mid \pi_{\mathcal{S}}(y \mid \tilde{y}))\) (Betancourt 2015), which vanishes when the two distributions are identical and monotonically increases as they deviate.
That said, in practice we can’t compute this divergence without knowing the true data generating process! Instead we have to approximate the divergence using just our observed data, which ultimately yields Bayesian cross validation and the many information criteria that further approximate that. Importantly in this series of approximations we lose the ability to estimate absolute divergences and instead can estimate only differences between divergences from the posterior predictive distributions of different models. This makes these approximations useful for model comparison, but not determining the adequacy of a single model in isolation.
Moreover, there’s also a deeper issue with comprehensive measures like the Kullback-Leibler divergence between the posterior predictive distribution and the true data generating process. We know our model will not capture every intimate detail of reality, and we really only want our model to capture those details relevant to our scientific goals. If we’re modeling bird migration then we don’t have any need to model the irrelevant effects of quantum mechanics or general relativity! Comprehensive measures, however, are often strongly affected by these irrelevant details, making it difficult to isolate the structures about which we do care.
Fortunately we’ve already thought about isolating the features of the observation space relevant to our scientific questions – they’re exactly what motivated the summary statistics we would use for a prior predictive check! Consequently we can reuse those summary statistics to construct posterior predictive checks that visually compare the pushforwards of the posterior predictive distribution, \(\pi_{t(Y) \mid Y}(t \mid \tilde{y})\), to the observed summary statistic, \(t(\tilde{y})\).
If the observed summary statistic, here shown in light red, falls within the bulk of the pushforward distribution, here shown in dark red, then our model is doing an adequate job of modeling the behaviors captured by the summary statistic.
If the observed summary statistic falls in the tails of the pushforward distribution, however, then the situation is not so clear.
Unfortunately we can’t discriminate between our observation being a rare but not impossible extreme and our model being insufficiently sophisticated to capture the relevant structure of the true data generating process that manifests in the summary statistic. Instead we have to rely on our domain expertise to identify possible model expansions that could resolve this discrepancy without compromising our inferences if the discrepancy is just a mirage.
As with prior predictive checks, in practice we typically can’t construct the probability density function for these pushforward distributions analytically in order to implement a posterior predictive check. We can, however, readily approximate it with samples from the posterior predictive distribution, \[ \tilde{y}' \sim \pi_{\mathcal{S}}(\theta \mid \tilde{y}), \] on which we can then evaluate our summary statistics, \[ t(\tilde{y}') \] to give samples from the pushforwards of the posterior predictive distribution.
Posterior predictive checks date back to Rubin (1984) and the beginnings of the simulation revolution in statistics.
If there are no indications that a given model is failing any of these four questions then we can proceed to apply it in practice confident that it will provide reasonably robust inferences for the relevant scientific goals. If there are signs of trouble for any of these four questions, however, then we have to consider improving the model, or occasionally even reducing the ambition of our scientific questions in the first place.
The goal of a principled Bayesian workflow is to guide a practitioner through a regimented model critique that addresses these four questions. In particular any criticisms need to be interpretable in order to inform the practitioner how to improve the analysis and address those criticisms. This feedback then further guides iterative model development that converges to a model capable of tackling the relevant scientific goals.
A robust model development is initialized with a conceptual analysis of the model that a practitioner would fit without the practical limitations of finite computational resources, time, and mathematical tools. This aspirational model, \(\mathcal{S}_{A}\), would incorporate all of the systematic effects that might influence the measurement process, such as heterogeneity across individuals or variation across space and time, and richer structure than typically presumed in the phenomenological models often used in practice. It would also incorporate all of the domain expertise available within a field. Reasoning about the aspirational model forces practitioners to reason about the subtleties in the latent system being studied and the measurement process being used to probe that system.
Within this abstract aspirational model we then explicitly build our initial model, \(\mathcal{S}_{1}\), to be just sophisticated enough to incorporate the phenomena of scientific interest but no more. This initial model typically includes few if any systematic effects and only as much domain expertise as needed to motivate an initial prior distribution.
The aspirational model provides critical context for this initial model. In particular, the difference between the two models, \(\mathcal{S}_{A} \setminus \mathcal{S}_{1}\), encapsulates all of the potentially relevant structure ignored by the initial model.
This abstract difference then motivates principled summary statistics that isolate these ignored structures, providing the foundations of constructive prior and posterior predictive checks for the initial model.
If the initial model proves inadequate then we have to improve it, and the context of the aspirational model guides that improvement so that we’re not adding arbitrary complexity irrelevant to our scientific goals. In particular, posterior predictive checks based on interpretable summary statistics immediately suggest the structure isolated by those summary statistics, such as heterogeneity or time dependence in the observational model or domain expertise neglected by the prior distribution. Adding this structure yields an expanded model, and if that expanded model still proves inadequate then we iterate, sequentially expanding our model until it becomes capable of achieving our scientific goals.
By reasoning about the aspirational model first we don’t randomly walk through the space of all possible models but rather take coherent steps from our simple initial model towards the aspirational model.
The emphasis on model expansion is critical. By subsuming the previous model a new model can always fall back to the previous model configurations if the added structure isn’t actually helpful in explaining the true data generating process. Moreover, quantifying our inferences with an entire posterior distribution, as opposed to say a point estimate of a single model configuration, ensures that the initial model configurations continue to influence our inferences so long as they remain consistent with the observations. This makes our model development workflow particularly robust to the faux pas of rushing towards unneeded complexity and overfitting, especially when equipping each expanded model with priors distribution that incorporate the qualitative features of penalized complexity priors that favor the previous model (Simpson et al. 2017).
Finally it’s important to recognize that a better model is not always sufficient to obtain the answers we want. Sometimes our initial scientific goals are too ambitious, or our experiment is too insensitive to the phenomena of interest. In these cases a useful workflow will also identify the limitations of the measurement process and suggest more realistic goals for the available data.
Putting this all together we can now design a principled Bayesian workflow that guides the development of robust models in practical applications. The workflow introduced here begins by examining the experimental design and the resulting measurement process. Only once the measurement process is understood do we build our initial model and study its performance in the context of our statistical and domain expertise. Finally we can analyze the fit of the model on the observed data, verifying the computational accuracy of the fit and the adequacy of the model to capture the relevant features of the true data generating process. At any point we might identify failures which motivate improved models for which the workflow restarts.
Here domain expertise refers to the experience of those responsible for collecting, curating, or otherwise manipulating the data, as well as stakeholders who will make decisions using the final inferences or any intermediaries who will help stakeholders to make those decisions. Statistical expertise refers to proficiency in probabilistic modeling and computation.
Before modeling anything we need to first consider the design of our experiment, especially in the context of our scientific goals.
We begin by developing a conceptual understanding of the measurement process from the latent phenomena of interest through its interactions with the environment and how those interactions manifest as observations. This interaction then needs to be contrasted with our scientific goals in order to qualify how sensitive the observations are to the desired scientific phenomena. Moreover, we want to pay particular attention to the many possible systematic effects that influence this process but are not known with complete certainty.
This conceptual analysis yields informal narratives describing how observations are generated, forming the basis of the aspirational model to which all of our explicit models will yearn to be when they grow up.
Requirements: Domain Expertise
Given our conceptual understanding of the measurement process we can take our first steps towards building a formal mathematical model by defining the space in which observations take values. This may include, for example, the number of repetitions, subjects, groups, or components in each observation as well as the range of values each component can take.
Requirements: Domain Expertise and Statistical Expertise
With an observation space formally defined we can begin to construct summary statistics that isolate relevant properties of the experimental design, especially those properties that influence decision making and those that we expect to be difficult to model and hence unlikely to be adequately described by our early models. These summary statistics provide then provide the basis for answering Question One and Question Four.
In preparation for prior predictive checks we also want to complement these summary statistics with conceptual expectations for what behaviors are reasonable, or, as is often easier to elicit in practice, what behaviors are unreasonable.
Requirements: Domain Expertise
Once the experimental design has been fully considered we can begin the modeling process in earnest. Before we utilize any observed data we want to ensure that our model holds steadfast under the scrutiny of Question One, Question Two, and Question Three.
A Bayesian model requires an observation model, \(\pi_{\mathcal{S}}(y \mid \theta)\), comprised of possible data generating processes and a prior model, \(\pi_{\mathcal{S}}(\theta)\), that encodes selected domain expertise. As our models grow in sophistication, however, the distinction between these two components is not always clear and so it is prescient to focus more on building the Bayesian joint distribution.
When first beginning the model development process our initial model should be as simple as possible, perhaps ignoring many if not all of the systematic effects of the environment in which the system of interest is contained. The initial model should be just sophisticated enough to be able to answer the scientific questions of interest in an idealized experiment.
The observational model is built from many probability distributions over the observation space, each defining a mathematical narrative of how observations are generated. This collection of data generating processes should translate the conceptual narrative produced in Step One into a more rigorous probabilistic model.
Often the observational model is a crude approximation to the complexity of the true measurement process, but even crude approximations can be sufficient to answer sophisticated questions in practice. Still, we want to keep the approximate nature of our observational model, in particular the natural extensions of the model to include for example heterogeneities, time dependences, and other systematic effects, in mind.
Requirements: Domain Expertise and Statistical Expertise
In a Bayesian model the observational model is complemented with a prior distribution over the space of all unknown model configuration parameters, \(\pi_{\mathcal{S}}(\theta)\). The prior distribution does not need to incorporate all of our available domain expertise, just enough to ensure that the joint model, \[ \pi_{\mathcal{S}}(y, \theta) = \pi_{\mathcal{S}}(y \mid \theta) \, \pi_{\mathcal{S}}(\theta), \] is well-behaved (Gelman, Simpson, and Betancourt 2017). In practice it is often sufficient for the prior distribution to incorporate information about the scales, units, or orders of magnitude which identify unreasonable but not impossible model configurations.
Keep in mind that the separation of the Bayesian model into an observation model and a prior model is not uniquely defined. Consider, for example, unobserved latent structure, \(\pi_{\mathcal{S}}(\theta \mid \phi)\), such as that arising in hierarchical models and hidden Markov models. We could equally well define the observation model as \(\pi_{\mathcal{S}}(y \mid \theta) \, \pi_{\mathcal{S}}(\theta \mid \phi)\) with the corresponding prior distribution \(\pi_{\mathcal{S}}(\phi)\), or the observation model as \(\pi_{\mathcal{S}}(y \mid \theta)\) with the prior distribution \(\pi_{\mathcal{S}}(\theta \mid \phi) \, \pi_{\mathcal{S}}(\phi)\).
Which is more appropriate in a given application depends on what parameters one expects to vary in the generation of new observations.
Consequently the focus should be not on specifying prior distributions in isolation but rather completing the observation model developed above with necessary domain expertise about the model configurations. This might take the form of independent, one-dimensional prior distributions for each parameter or more sophisticated prior distributions that incorporate multiple parameters. These joint priors often introduce their own parameters that require another round of prior distributions. When building sophisticated models it’s probability distributions all the way down.
Requirements: Domain Expertise and Statistical Expertise
Once a we have developed an explicit model we can reevaluate the context of the aspirational model and modify existing summary statistics or introduce new ones.
If the current model is an expansion from a previous model then we can also take this opportunity to exploit any information from failures of the previous model or information about added features to motivate new summary statistics. For example, if we add a new systematic effect then we might also introduce summary statistics that are sensitive to the heterogeneity of that systematic effect across configurations of the experiment, such as which individual collecting the data or at which site the data were collected.
Before considering any observed data we first want to understand the consequences of our modeling assumptions by answering Question One, Question Two, and Question Three.
Fortunately this analysis is readily performed with samples of reasonable model configurations and observations from the joint distribution, \(\pi_{\mathcal{S}}(y, \theta)\). Typically we first simulate model configurations from the prior distribution and then observations from the corresponding data generating process, \[ \tilde{\theta} \sim \pi_{\mathcal{S}}(\theta) \\ \tilde{y} \sim \pi_{\mathcal{S}}(y \mid \tilde{\theta} ), \] although when the separation between observational and prior models isn’t so clear we can consider any means of sampling from the joint distribution.
Requirements: Statistical Expertise
Answers to Question One are readily informed with prior predictive checks of the summary statistics constructed in Step Three or Step Five.
If the prior predictive checks indicate conflict between the model and our domain expertise then we have to return to Step Four and modify our model, typically by incorporating additional domain expertise to better constrain the prior distribution.
Requirements: Domain Expertise
For each simulated observation we can also construct a posterior distribution with a given computational method, \(\pi_{\mathcal{S}}(\theta \mid \tilde{y})\), or at least attempt as much. In particular, the distribution of reconstructed posteriors allows to evaluate the accuracy of the computational method, answering Question Two.
If the designated computational method features diagnostics, such as \(\hat{R}\) for Markov chain Monte Carlo in general and divergences for Hamiltonian Monte Carlo in particular, then we can look for failed diagnostics across the distribution of posterior fits. Failures indicate that we need to consider improving the tuning or implementation the computational method, or selecting another method altogether, and then repeating the fits.
Correlating failed diagnostics with the corresponding simulated model configurations can also help to identify regions of the model configuration space that may be ill-posed for practical computation. Moreover, the emergence of these pathological fits implies extreme behavior in the posterior which often hints at tension between the model and our domain expertise that may require reevaluating Step Four.
Requirements: Statistical Expertise
Even if our chosen computational method doesn’t have its own diagnostics, we can evaluate its performance by performing simulation-based calibration over the joint sample. As above, if the performance of the method is unsatisfactory then we need to consider retuning the method, or selecting another method altogether and then repeating the fits.
Requirements: Statistical Expertise
Once we trust the faithfulness of our reconstructed posterior distributions we can consider the breadth of inferences they provide to answer Question Three. In general we can always consider the distribution of posterior \(z\)-scores against posterior shrinkages to identify common pathologies in our inferences.
If there are indications of undesired behavior, such as overfitting or non-identifiability, then we might need to return to Step One to reconsider an improved experimental design or tempered scientific goals.
Sometimes we may need only return to Step Four to incorporate additional domain expertise that stabilizes our inferences.
Requirements: Statistical Expertise
If a decision-making process has been established then we can also calibrate the performance of this process by constructing the distribution of utilities over the joint distribution. For example, if a discovery claim is to be made then we at the very least want to compute the corresponding false discovery rates and true discovery rates within the context of our model.
Once again, if the calibration proves unsatisfactory then we will need to return to either Step One or Step Four.
Requirements: Domain Expertise and Statistical Expertise
If Question One, Question Two, and Question Three have been satisfactorily answered within the scope of the model then we can proceed to fitting the observed data and confront Question Four. Once we consider the observed data we introduce a vulnerability to overfitting and so we must be extremely vigilant to study the behavior of an expanded models in subsequent iterations of the workflow.
Confident in our model we can construct a posterior distribution from the observed data. We have to be careful, however, as the previous validation of our chosen computational method may not apply if the observed data is not within the scope of the prior predictive distribution, which can happen when the model is too simple to capture the relevant structure of the true data generating process. Consequently we need to be careful to check any diagnostics of our method on the fit of the observed data.
If any diagnostics indicate poor performance then not only is our computational method suspect but also our model may not be rich enough to capture the relevant details of the observed data. At the very least we should reconsider the tuning of our computational tool, but we may also want to consider a deeper investigation into the poor performance to better identify the source of the pathology. This analysis can suggest limitations of our model which can motivate improvements in Step Four.
Requirements: Statistical Expertise
When we construct a posterior distribution for the observed data we can also perform a posterior predictive check for each of our summary statistics to answer Question Four. If there are indications of problems then we need to go back to Step Four and expand the model as motivated by the interpretation of the summary statistic. We may even need to go back to Step One and improve our domain expertise of the experimental design before considering an expanded model. Once we have a new model we then restart the workflow and iterate until we can’t identify any substantial deviations between the observed data and our fit.
In order to avoid biasing our model development by overfitting to the observed data, we have to be careful to not abuse the particular realization of the observed data. Specifically we don’t want to respond to failing posterior predictive checks by incorporating quantitative features of the observed data into the model but rather by expanding the model within the scope of our, possibly improved, domain expertise. Failures of posterior predictive checks should motivate expansions of the model but not particular model configurations within that expanded model!
For example, we don’t want to respond to a failing posterior predictive check by adding a new component to our model with particular values to match the observed data but rather a new component with uncertain values that are instead fit jointly with the rest of the model parameters. The prior distribution for this new component should not be informed by the observed data but rather our domain expertise.
Requirements: Domain Expertise and Statistical Expertise
In order to demonstrate this proposed workflow let’s a consider a relatively simple example of an analysis that might arise in any science lab. We have been tasked with analyzing data collected by a precocious young student who was instructed to train a aim of detectors onto a radioactive specimen that emits a constant flow of particles. They dutifully collected data and collated it into a file which has arrived in our inbox.
Before even reading the email we dutifully set up our local R environment.
library(rstan)Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
rstan_options(auto_write = TRUE)
library(foreach)Warning: package 'foreach' was built under R version 3.4.3
library(doParallel)Loading required package: iterators
Warning: package 'iterators' was built under R version 3.4.3
Loading required package: parallel
util <- new.env()
source('stan_utility.R', local=util)
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")In our first iteration through the workflow we model the experiment that we’ve been told occurred. Nothing ominous here.
Our conceptual model follows the instructions given to the student. 1000 detectors were directed towards the source and set to record the number of incident particles over a given interval of time. The intensity of the source is presumed to be constant over that interval and each detector should be identical in operation.
Mathematically our observation takes the form of integer counts, \(y\), for each of the \(N = 1000\) detectors. In the Stan modeling language this is specified as
writeLines(readLines("fit_data.stan", n=4))data {
int N; // Number of observations
int y[N]; // Count at each observation
}
There are \(N = 1000\) components in each observation, one for each detector. We could analyze each component independently but, because we have assumed that the detectors are all identical, we can analyze their comprehensive responses with a histogram of their counts. In other words we consider the histogram of detector counts as the summary statistic!
It’s always difficult to create examples that incorporate domain expertise as the presentation of domain expertise that resonates with a given practitioner will vary from field to field. For this example we’ll fake it with my explicitly decreeing the available expertise. In particular, we assume that our conceptual domain expertise would inform us that 25 counts in a detector would be an extreme but not impossible observation for this kind of detector.
The constant intensity of the source and detector response suggest a Poisson observation model for each of the detectors with a single source strength, \(\lambda\).
Our domain expertise that 25 counts is extreme suggests that we want our prior for \(\lambda\) to keep most of its probability mass below \(\lambda = 15\), which corresponds to fluctuations in the observations around \(15 + 3 \sqrt{15} \approx 25\). We achieve this with a half-normal prior with standard deviation = 6.44787 such that only 1% of the prior probability mass is above \(\lambda = 15\).
lambda <- seq(0, 20, 0.001)
plot(lambda, dnorm(lambda, 0, 6.44787), type="l", col=c_dark_highlight, lwd=2,
xlab="lambda", ylab="Prior Density", yaxt='n')
lambda99 <- seq(0, 15, 0.001)
dens <- dnorm(lambda99, 0, 6.44787)
lambda99 <- c(lambda99, 15, 0)
dens <- c(dens, 0, 0)
polygon(lambda99, dens, col=c_dark, border=NA)This probabilistic model is implemented in the Stan programs
writeLines(readLines("sample_joint_ensemble.stan"))data {
int N;
}
generated quantities {
// Simulate model configuration from prior model
real<lower=0> lambda = fabs(normal_rng(0, 6.44787));
// Simulate data from observational model
int y[N];
for (n in 1:N) y[n] = poisson_rng(lambda);
}
writeLines(readLines("fit_data.stan"))data {
int N; // Number of observations
int y[N]; // Count at each observation
}
parameters {
real<lower=0> lambda; // Poisson intensity
}
model {
lambda ~ normal(0, 6.44787); // Prior model
y ~ poisson(lambda); // Observational model
}
The independent, identically distributed Poisson model is well suited by the histogram summary statistic that we’ve already introduced.
Were this model more complex then we might consider summary statistics that exploit the specific structure of the model. For example, were there variation in the Poisson intensity across groups we might consider separate histograms for the data arising from each of those groups.
Here we will consider \(R = 1000\) draws from the joint ensemble, each of which simulates a ground truth and observed values for the \(N = 1000\) detectors. Note the use of the Fixed_param algorithm in Stan which only evaluates the generated quantities block at each iteration.
R <- 1000
N <- 1000
simu_data <- list("N" = N)
fit <- stan(file='sample_joint_ensemble.stan', data=simu_data,
iter=R, warmup=0, chains=1, refresh=R,
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'sample_joint_ensemble' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.102751 seconds (Sampling)
0.102751 seconds (Total)
simu_lambdas <- extract(fit)$lambda
simu_ys <- extract(fit)$yMore draws from the joint ensemble will yield a more precise understanding of the model and I recommend that you run as many replications as your computational resources allow. In circumstances with limited computational resources even a few replications can provide a powerful view into the consequences of your model.
Each simulated observation in the joint ensemble gives a summary histogram. For example, one sample gives the histogram
B <- 40
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
counts <- sapply(1:R, function(r) hist(simu_ys[r,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
pad_counts <- sapply(1:R, function(r) do.call(cbind, lapply(idx, function(n) counts[n + 1, r])))
c_superfine <- c("#8F272755")
plot(1, type="n", main="Prior Predictive Histogram Samples",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, 500), ylab="")
lines(x, pad_counts[,1], col=c_superfine, lw=2)Three samples give the three histograms
plot(1, type="n", main="Prior Predictive Histogram Samples",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, 500), ylab="")
lines(x, pad_counts[,1], col=c_superfine, lw=2)
lines(x, pad_counts[,2], col=c_superfine, lw=2)
lines(x, pad_counts[,3], col=c_superfine, lw=2)and ten samples give the ten histograms
plot(1, type="n", main="Prior Predictive Histogram Samples",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, 500), ylab="")
for (r in 1:10)
lines(x, pad_counts[,r], col=c_superfine, lw=2) We can visualize the entire prior predictive distribution of these histogram with quantiles of the bin counts. Here the darkest red corresponds to the bin medians with the increasingly lighter bands corresponding to \((0.4, 0.6)\), \((0.3, 0.7)\), \((0.2, 0.8)\), and \((0.1, 0.9)\) intervals, respectively. This visualization can obscure correlations amongst the bins, but it serves as a useful summary of the full prior predictive distribution.
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Prior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, max(cred[9,])), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
abline(v=25, col="white", lty=1, lw=2.5)
abline(v=25, col="black", lty=1, lw=2)We see a very small prior predictive probability for observations above the extreme scale informed by our domain expertise, which is corroborated by the estimated tail probability,
length(simu_ys[simu_ys > 25]) / length(simu_ys)[1] 0.000933
Now we are ready to fit posteriors and draw inferences from each of our replications. Because each replication can be analyzed independently this analysis is embarrassingly parallel which allows us to exploit the increasingly common parallel computing resources.
Because this particular model can be fit so quickly I am parallelizing over replications and not parallelizing the four Markov chains run to fit each replication to which Stan defaults. The optimal setup in any given analysis will depend on the details of the computational circumstances.
For each fit we capture the fit diagnostics, the simulation-based calibration rank, and the posterior \(z\)-score and posterior shrinkage.
tryCatch({
registerDoParallel(makeCluster(detectCores()))
simu_list <- t(data.matrix(data.frame(simu_lambdas, simu_ys)))
# Compile the posterior fit model
fit_model = stan_model(file='fit_data.stan')
ensemble_output <- foreach(simu=simu_list,
.combine='cbind') %dopar% {
simu_lambda <- simu[1]
simu_y <- simu[2:(N + 1)];
# Fit the simulated observation
input_data <- list("N" = N, "y" = simu_y)
capture.output(library(rstan))
capture.output(fit <- sampling(fit_model, data=input_data, seed=4938483))
# Compute diagnostics
util <- new.env()
source('stan_utility.R', local=util)
warning_code <- util$check_all_diagnostics(fit, quiet=TRUE)
# Compute rank of prior draw with respect to thinned posterior draws
sbc_rank <- sum(simu_lambda < extract(fit)$lambda[seq(1, 4000 - 8, 8)])
# Compute posterior sensitivities
s <- summary(fit, probs = c(), pars='lambda')$summary
post_mean_lambda <- s[,1]
post_sd_lambda <- s[,3]
prior_sd_lambda <- 6.44787
z_score <- (post_mean_lambda - simu_lambda) / post_sd_lambda
shrinkage <- 1 - (post_sd_lambda / prior_sd_lambda)**2
c(warning_code, sbc_rank, z_score, shrinkage)
}
}, finally={ stopImplicitCluster() })The first thing we do is check the diagnostics to make sure that each fit is accurately representing the corresponding posterior distribution.
warning_code <- ensemble_output[1,]
if (sum(warning_code) != 0) {
print ("Some simulated posterior fits in the joint ensemble encountered problems!")
for (r in 1:R) {
if (warning_code[r] != 0) {
print(sprintf('Replication %s of %s', r, R))
util$parse_warning_code(warning_code[r])
print(sprintf('Simulated lambda = %s', simu_lambdas[r]))
print(" ")
}
}
} else {
print ("No posterior fits in the joint ensemble encountered problems!")
}[1] "No posterior fits in the joint ensemble encountered problems!"
Fortunately there are no signs of any problems.
We can provide an even stronger guarantee that our fits are accurate by applying simulation-based calibration.
sbc_rank <- ensemble_output[2,]
sbc_hist <- hist(sbc_rank, seq(0, 500, 25) - 0.5,
col=c_dark, border=c_dark_highlight, plot=FALSE)Warning in hist.default(sbc_rank, seq(0, 500, 25) - 0.5, col = c_dark,
border = c_dark_highlight, : arguments 'col', 'border' are not made use of
plot(sbc_hist, main="", xlab="Prior Rank", yaxt='n', ylab="")
low <- qbinom(0.005, R, 1 / 20)
mid <- qbinom(0.5, R, 1 / 20)
high <- qbinom(0.995, R, 1 / 20)
bar_x <- c(-10, 510, 500, 510, -10, 0, -10)
bar_y <- c(high, high, mid, low, low, mid, high)
polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=0, x1=500, y0=mid, y1=mid, col=c("#999999"), lwd=2)
plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)The variations in the rank histogram are within the expectations of uniformity shown in the grey bar, which suggests that the fits are indeed faithfully representing the true posterior distributions.
Confident in our fits we can then analyze the posterior distributions themselves.
z_score <- ensemble_output[3,]
shrinkage <- ensemble_output[4,]
plot(shrinkage, z_score, col=c("#8F272720"), lwd=2, pch=16, cex=0.8,
xlim=c(0, 1), xlab="Posterior Shrinkage",
ylim=c(-5, 5), ylab="Posterior z-Score")The fitted posteriors concentrate towards large shrinkages which indicates that all of our posteriors in the joint ensemble are well-identified. Moreover, the concentration towards small posterior \(z\)-scores indicates that they also accurately capture the simulated model configurations.
Without an explicit decision or precise scientific question in mind we don’t have any bespoke model sensitivities to perform. This would be different if we, for example, were interested in identifying differences between the detectors or subtle phenomena in the source being observed.
Confident that our model captures our domain expertise and is well-behaved within the scope of its own assumptions we can fire up our email client and grab the student’s data.
input_data <- read_rdump('workflow.data.R')
fit <- stan(file='fit_data_ppc.stan', data=input_data,
seed=4938483, refresh=2000)
SAMPLING FOR MODEL 'fit_data_ppc' NOW (CHAIN 1).
Gradient evaluation took 2.5e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 0.163745 seconds (Warm-up)
0.170782 seconds (Sampling)
0.334527 seconds (Total)
SAMPLING FOR MODEL 'fit_data_ppc' NOW (CHAIN 2).
Gradient evaluation took 1.7e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 0.166411 seconds (Warm-up)
0.158362 seconds (Sampling)
0.324773 seconds (Total)
SAMPLING FOR MODEL 'fit_data_ppc' NOW (CHAIN 3).
Gradient evaluation took 2.3e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 0.161464 seconds (Warm-up)
0.167536 seconds (Sampling)
0.329 seconds (Total)
SAMPLING FOR MODEL 'fit_data_ppc' NOW (CHAIN 4).
Gradient evaluation took 1.7e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 0.169532 seconds (Warm-up)
0.175463 seconds (Sampling)
0.344995 seconds (Total)
The fit shows no diagnostic problems,
util$check_all_diagnostics(fit)
and the recovered posterior for the source intensity looks reasonable.
params = extract(fit)
hist(params$lambda, main="", xlab="lambda", yaxt='n', ylab="",
col=c_dark, border=c_dark_highlight)But how well does our model capture the structure of the data? To perform a posterior predictive check we compare the observed histogram of counts to the posterior predictive distribution of counts, here visualized with marginal bin quantiles as we did for the prior predictive distribution.
B <- 30
obs_counts <- hist(input_data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))
counts <- sapply(1:4000, function(n) hist(params$y_ppc[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Posterior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y",
ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)Unfortunately the posterior predictive check indicates a serious excess of zeros above what we’d expect from a Poissonian observational model alone. Our model is not flexible enough to capture both the peak at zero and the bulk away from zero, instead trying to compromise between these two behaviors and failing to capturing either particularly well. This experiment isn’t quite as simple as it first appeared, but we can use the structure of this failure to motivate an improved model and try again.
The conflict between the observation and the model prediction we saw in the posterior predictive check immediately suggests an expanded model where we introduce a second source of zeroes, and a second iteration of our workflow.
Because we hadn’t previously considered a source of excess zeroes we should return all the way back to the conceptual analysis in order to understand the possible mechanisms for these zeroes in our measurement process. The excess suggests that not all of the detectors are in perfect working order, returning zero counts regardless of the flux of incident radiation. Upon further consideration this isn’t unreasonable for detectors that a student excavated from a dusty closet.
The observation space does not change in our expanded model, just the probabilistic structure of our possible data generating processes.
The same histogram summary will be useful in summarizing the observations.
We can readily capture this possibility of failing detectors with a zero-inflated Poisson model that mixes a Poisson distribution with a point distribution that concentrates entirely at zero. We use the same prior for the source intensity, \(\lambda\), and assign a uniform prior over the mixture weight, \(\theta\).
This expanded model is implemented in the Stan programs
writeLines(readLines("sample_joint_ensemble2.stan"))data {
int N;
}
generated quantities {
// Simulate model configuration from prior model
real<lower=0, upper=1> theta = beta_rng(1, 1);
real<lower=0> lambda = fabs(normal_rng(0, 6.44787));
// Simulate data from observational model
int y[N] = rep_array(0, N);
for (n in 1:N)
if (!bernoulli_rng(theta))
y[n] = poisson_rng(lambda);
}
writeLines(readLines("fit_data2.stan"))data {
int N; // Number of observations
int y[N]; // Count at each observation
}
parameters {
real<lower=0, upper=1> theta; // Excess zero probability
real<lower=0> lambda; // Poisson intensity
}
model {
// Prior model
theta ~ beta(1, 1);
lambda ~ normal(0, 6.44787);
// Observational model that mixes a Poisson with excess zeros
for (n in 1:N) {
real lpdf = poisson_lpmf(y[n] | lambda);
if (y[n] == 0)
target += log_mix(theta, 0, lpdf);
else
target += log(1 - theta) + lpdf;
}
}
Because zero-inflation manifests in the histogram summary statistic we don’t need a new summary statistic to isolate this behavior. If we wanted to be exhaustive, however, we could consider new summary statistics such as the number of observed zeros or the ratio of observed zeroes to total observations.
The structure of our joint ensemble doesn’t change.
R <- 1000
N <- 1000
simu_data <- list("N" = N)
fit <- stan(file='sample_joint_ensemble2.stan', data=simu_data,
iter=R, warmup=0, chains=1, refresh=R,
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'sample_joint_ensemble2' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.080365 seconds (Sampling)
0.080365 seconds (Total)
simu_lambdas <- extract(fit)$lambda
simu_thetas <- extract(fit)$theta
simu_ys <- extract(fit)$yThe prior predictive distribution of histograms now manifests the zero-inflation we have added to our model.
B <- 50
counts <- sapply(1:R, function(r) hist(simu_ys[r,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Prior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, max(cred[9,])), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
abline(v=25, col="white", lty=1, lw=2.5)
abline(v=25, col="black", lty=1, lw=2)This addition, however, doesn’t change the tail of observations and hence we still have an appropriately small prior predictive probability for extreme counts.
length(simu_ys[simu_ys > 25]) / length(simu_ys)[1] 0.000231
There are fewer extreme counts than before because the zero inflation suppresses the relative number of non-zero observations.
We proceed to fitting each replication, being sure to capture simulation based calibration ranks and posterior summaries for the new parameter.
tryCatch({
registerDoParallel(makeCluster(detectCores()))
simu_list <- t(data.matrix(data.frame(simu_lambdas, simu_thetas, simu_ys)))
# Compile the posterior fit model
fit_model = stan_model(file='fit_data2.stan')
ensemble_output <- foreach(simu=simu_list,
.combine='cbind') %dopar% {
simu_lambda <- simu[1]
simu_theta <- simu[2]
simu_y <- simu[3:(N + 2)];
# Fit the simulated observation
input_data <- list("N" = N, "y" = simu_y)
capture.output(library(rstan))
capture.output(fit <- sampling(fit_model, data=input_data, seed=4938483))
# Compute diagnostics
util <- new.env()
source('stan_utility.R', local=util)
warning_code <- util$check_all_diagnostics(fit, quiet=TRUE)
# Compute rank of prior draw with respect to thinned posterior draws
sbc_rank_lambda <- sum(simu_lambda < extract(fit)$lambda[seq(1, 4000 - 8, 8)])
sbc_rank_theta <- sum(simu_theta < extract(fit)$theta[seq(1, 4000 - 8, 8)])
# Compute posterior sensitivities
s <- summary(fit, probs = c(), pars='lambda')$summary
post_mean_lambda <- s[,1]
post_sd_lambda <- s[,3]
prior_sd_lambda <- 6.44787
z_score_lambda <- (post_mean_lambda - simu_lambda) / post_sd_lambda
shrinkage_lambda <- 1 - (post_sd_lambda / prior_sd_lambda)**2
s <- summary(fit, probs = c(), pars='theta')$summary
post_mean_theta <- s[,1]
post_sd_theta <- s[,3]
prior_sd_theta <- sqrt(1.0 / 12)
z_score_theta <- (post_mean_theta - simu_theta) / post_sd_theta
shrinkage_theta <- 1 - (post_sd_theta / prior_sd_theta)**2
c(warning_code,
sbc_rank_lambda, z_score_lambda, shrinkage_lambda,
sbc_rank_theta, z_score_theta, shrinkage_theta)
}
}, finally={ stopImplicitCluster() })First things first we check the fit diagnostics for each replication.
warning_code <- ensemble_output[1,]
if (sum(warning_code) != 0) {
print ("Some simulated posterior fits in the joint ensemble encountered problems!")
for (r in 1:R) {
if (warning_code[r] != 0) {
print(sprintf('Replication %s of %s', r, R))
util$parse_warning_code(warning_code[r])
print(sprintf('Simulated lambda = %s', simu_lambdas[r]))
print(sprintf('Simulated theta = %s', simu_thetas[r]))
print(" ")
}
}
} else {
print ("No posterior fits in the joint ensemble encountered problems!")
}[1] "Some simulated posterior fits in the joint ensemble encountered problems!"
[1] "Replication 228 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 4.16358468916065"
[1] "Simulated theta = 0.999066957140718"
[1] " "
[1] "Replication 232 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.00254846266903879"
[1] "Simulated theta = 0.725747877279408"
[1] " "
[1] "Replication 269 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.0269885176204752"
[1] "Simulated theta = 0.415686660992948"
[1] " "
[1] "Replication 478 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 2.70935972544222"
[1] "Simulated theta = 0.00101399446718843"
[1] " "
[1] "Replication 531 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.947789512750722"
[1] "Simulated theta = 0.0849450815790578"
[1] " "
[1] "Replication 535 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.476598090298011"
[1] "Simulated theta = 0.0437931486024614"
[1] " "
[1] "Replication 603 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.0289573160402738"
[1] "Simulated theta = 0.971606827223978"
[1] " "
[1] "Replication 671 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.0789541827011773"
[1] "Simulated theta = 0.646975431309257"
[1] " "
[1] "Replication 941 of 1000"
[1] "treedepth warning"
[1] "Simulated lambda = 13.3165181177665"
[1] "Simulated theta = 0.997040220734551"
[1] " "
[1] "Replication 964 of 1000"
[1] "divergence warning"
[1] "Simulated lambda = 0.0053831886489381"
[1] "Simulated theta = 0.876539739817126"
[1] " "
Unfortunately our computed fits are nowhere near as reliable for this expanded model! Indeed if we carefully analyze the simulated model configurations for each of the problematic fits we notice that problems arise whenever \(\lambda\) is small or \(\theta\) is large. That seems a little suspicious, no?
It should, because this is a manifestation of a non-identifiability hiding in our zero-inflated model. When the Poisson source strength is small enough the resulting distribution will be indistinguishable from the distribution concentrated at zero that we are using to model the excess zeroes.
While we could try to reconfigure our computational method to better handle this non-identifiability, its manifestation provides us with an opportunity to reconsider our domain expertise and tweak our model. Is it reasonable for the source intensity to be small enough that we would expect mostly zeroes from the Poisson model?
Because computational problems correlate with extreme posterior behavior, they have a knack for identifying the deficiency of the domain expertise we’ve incorporated into our model. We should heed their warnings carefully!
Taking a deep breath we push on with our third iteration of the model building workflow.
Just how reasonable is it for a working detector aimed at a well-known source to admit a small Poisson intensity, or for all of the detectors to be working or not working? The problems identified by the above computational problems identified some important questions that we hadn’t addressed with our domain expertise.
Upon reflection let’s say that our domain expertise specifies that the source we’re studying has been previously calibrated to test detectors like these so that zero observations are rare for a working detector exposed to the source for as long as ours were. Moreover, while some of the detectors may be flakey, a bunch were recently used and so not all of them should be failing. While our observational model will not change, these recognitions imply more informative priors for the zero-inflated model. Working detectors and malfunctioning detectors should be distinguishable!
Importantly these insights were not learned from any observations. They come from domain expertise that has always existed, if not within us then within our colleagues, but had not been incorporated into our model. Only when computational problems identified pathological behaviors did we reevaluate and expand our domain expertise.
The observation space remains the same.
As does the utility of the histogram summary.
We need to introduce prior distributions that incorporate enough of our newly considered domain expertise to separate the behavior of the working detectors and the malfunctioning detectors. At the same time we have to be careful to avoid using the observed data to inform these prior distributions lest we introduce the potential for overfitting to that observation. Instead we want to use the tension in the posterior predictive check to suggest an additional model component, here the zero-inflation, and then use our domain expertise to motivate appropriate prior distributions independent of the details of the observation itself.
We can tame the non-identifiability that is inconsistent with our domain expertise by utilizing an inverse Gamma distribution that places only 1% probability below \(\lambda = 1\), where the Poisson distribution starts to look like the distribution concentrated at zero that we’re using to model the malfunctioning detectors. At the same time we want to maintain 1% of the prior probability above \(\lambda = 15\).
lambda <- seq(0, 20, 0.001)
dens <- lapply(lambda, function(l) dgamma(1 / l, 3.48681, rate=9.21604) * (1 / l**2))
plot(lambda, dens, type="l", col=c_dark_highlight, lwd=2,
xlab="lambda", ylab="Prior Density", yaxt='n')
lambda98 <- seq(1, 15, 0.001)
dens <- lapply(lambda98, function(l) dgamma(1 / l, 3.48681, rate=9.21604) * (1 / l**2))
lambda98 <- c(lambda98, 15, 1)
dens <- c(dens, 0, 0)
polygon(lambda98, dens, col=c_dark, border=NA)Additionally, let’s take a prior distribution for the zero mixture probability, \(\theta\), that puts only 1% probability mass below 0.1 and above 0.9 to capture our domain knowledge that some detectors are expected to fail but not all of them. Again, we want to be careful to not use the exact number of zeros in the observed data to inform this prior distribution.
theta <- seq(0, 1, 0.001)
dens <- dbeta(theta, 2.8663, 2.8663)
plot(theta, dens, type="l", col=c_dark_highlight, lwd=2,
xlab="theta", ylab="Prior Density", yaxt='n')
theta98 <- seq(0.1, 0.9, 0.001)
dens <- dbeta(theta98, 2.8663, 2.8663)
theta98 <- c(theta98, 0.9, 0.1)
dens <- c(dens, 0, 0)
polygon(theta98, dens, col=c_dark, border=NA)The new model is implemented in the Stan programs
writeLines(readLines("sample_joint_ensemble3.stan"))data {
int N;
}
generated quantities {
// Simulate model configuration from prior model
real<lower=0, upper=1> theta = beta_rng(2.8663, 2.8663);
real<lower=0> lambda = inv_gamma_rng(3.48681, 9.21604);
// Simulate data from observational model
int y[N] = rep_array(0, N);
for (n in 1:N)
if (!bernoulli_rng(theta))
y[n] = poisson_rng(lambda);
}
writeLines(readLines("fit_data3.stan"))data {
int N; // Number of observations
int y[N]; // Count at each observation
}
parameters {
real<lower=0, upper=1> theta; // Excess zero probability
real<lower=0> lambda; // Poisson intensity
}
model {
// Prior model
theta ~ beta(2.8663, 2.8663);
lambda ~ inv_gamma(3.48681, 9.21604);
// Observational model that mixes a Poisson with excess zeros
for (n in 1:N) {
real lpdf = poisson_lpmf(y[n] | lambda);
if (y[n] == 0)
target += log_mix(theta, 0, lpdf);
else
target += log(1 - theta) + lpdf;
}
}
The histogram summary statistic is still capable of capturing the relevant structure of the observations.
The joint ensemble is generated as before.
R <- 1000
N <- 1000
simu_data <- list("N" = N)
fit <- stan(file='sample_joint_ensemble3.stan', data=simu_data,
iter=R, warmup=0, chains=1, refresh=R,
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'sample_joint_ensemble3' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.090386 seconds (Sampling)
0.090386 seconds (Total)
simu_lambdas <- extract(fit)$lambda
simu_thetas <- extract(fit)$theta
simu_ys <- extract(fit)$yWith the modified prior distributions we see a slightly different prior predictive average histogram, but one still compatible with our domain expertise.
B <- 60
counts <- sapply(1:R, function(r) hist(simu_ys[r,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Prior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, max(cred[9,])), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
abline(v=25, col="white", lty=1, lw=2.5)
abline(v=25, col="black", lty=1, lw=2)Indeed we have the desired low probability for extreme observations.
length(simu_ys[simu_ys > 25]) / length(simu_ys)[1] 0.000965
We can now proceed to fitting each replication, updating the posterior shrinkage calculations to take into account the updated prior distributions.
tryCatch({
registerDoParallel(makeCluster(detectCores()))
simu_list <- t(data.matrix(data.frame(simu_lambdas, simu_thetas, simu_ys)))
# Compile the posterior fit model
fit_model = stan_model(file='fit_data3.stan')
ensemble_output <- foreach(simu=simu_list,
.combine='cbind') %dopar% {
simu_lambda <- simu[1]
simu_theta <- simu[2]
simu_y <- simu[3:(N + 2)];
# Fit the simulated observation
input_data <- list("N" = N, "y" = simu_y)
capture.output(library(rstan))
capture.output(fit <- sampling(fit_model, data=input_data, seed=4938483))
# Compute diagnostics
util <- new.env()
source('stan_utility.R', local=util)
warning_code <- util$check_all_diagnostics(fit, quiet=TRUE)
# Compute rank of prior draw with respect to thinned posterior draws
sbc_rank_lambda <- sum(simu_lambda < extract(fit)$lambda[seq(1, 4000 - 8, 8)])
sbc_rank_theta <- sum(simu_theta < extract(fit)$theta[seq(1, 4000 - 8, 8)])
# Compute posterior sensitivities
s <- summary(fit, probs = c(), pars='lambda')$summary
post_mean_lambda <- s[,1]
post_sd_lambda <- s[,3]
prior_sd_lambda <- sqrt( (9.21604)**2 / ((3.48681 - 1)**2 * (3.48681 - 1)) )
z_score_lambda <- (post_mean_lambda - simu_lambda) / post_sd_lambda
shrinkage_lambda <- 1 - (post_sd_lambda / prior_sd_lambda)**2
s <- summary(fit, probs = c(), pars='theta')$summary
post_mean_theta <- s[,1]
post_sd_theta <- s[,3]
prior_sd_theta <- sqrt( (2.8663)**2 / (4 * (2.8663)**2 * (2 * 2.8663 + 1)) )
z_score_theta <- (post_mean_theta - simu_theta) / post_sd_theta
shrinkage_theta <- 1 - (post_sd_theta / prior_sd_theta)**2
c(warning_code,
sbc_rank_lambda, z_score_lambda, shrinkage_lambda,
sbc_rank_theta, z_score_theta, shrinkage_theta)
}
}, finally={ stopImplicitCluster() })With the non-identifiability tempered there are no diagnostic indications of fitting problems.
warning_code <- ensemble_output[1,]
if (sum(warning_code) != 0) {
print ("Some simulated posterior fits in the joint ensemble encountered problems!")
for (r in 1:R) {
if (warning_code[r] != 0) {
print(sprintf('Replication %s of %s', r, R))
util$parse_warning_code(warning_code[r])
print(sprintf('Simulated lambda = %s', simu_lambdas[r]))
print(sprintf('Simulated theta = %s', simu_thetas[r]))
print(" ")
}
}
} else {
print ("No posterior fits in the joint ensemble encountered problems!")
}[1] "No posterior fits in the joint ensemble encountered problems!"
Similarly, the simulation-based histograms for both \(\lambda\) and \(\theta\) show no signs of errors in our fits.
sbc_rank <- ensemble_output[2,]
sbc_hist <- hist(sbc_rank, seq(0, 500, 25) - 0.5,
col=c_dark, border=c_dark_highlight, plot=FALSE)Warning in hist.default(sbc_rank, seq(0, 500, 25) - 0.5, col = c_dark,
border = c_dark_highlight, : arguments 'col', 'border' are not made use of
plot(sbc_hist, main="", xlab="Prior Rank (Lambda)", yaxt='n', ylab="")
low <- qbinom(0.005, R, 1 / 20)
mid <- qbinom(0.5, R, 1 / 20)
high <- qbinom(0.995, R, 1 / 20)
bar_x <- c(-10, 510, 500, 510, -10, 0, -10)
bar_y <- c(high, high, mid, low, low, mid, high)
polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=0, x1=500, y0=mid, y1=mid, col=c("#999999"), lwd=2)
plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)sbc_rank <- ensemble_output[5,]
sbc_hist <- hist(sbc_rank, seq(0, 500, 25) - 0.5,
col=c_dark, border=c_dark_highlight, plot=FALSE)Warning in hist.default(sbc_rank, seq(0, 500, 25) - 0.5, col = c_dark,
border = c_dark_highlight, : arguments 'col', 'border' are not made use of
plot(sbc_hist, main="", xlab="Prior Rank (Lambda)", yaxt='n', ylab="")
mean <- length(sbc_rank) / (500 / 25)
low <- mean - 3 * sqrt(mean)
high <- mean + 3 * sqrt(mean)
bar_x <- c(-10, 510, 500, 510, -10, 0, -10)
bar_y <- c(high, high, mean, low, low, mean, high)
polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=0, x1=500, y0=mean, y1=mean, col=c("#999999"), lwd=2)
plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)This leaves us to check the ensemble behavior of our recovered posterior distributions which looks reasonable for both parameters.
z_score <- ensemble_output[3,]
shrinkage <- ensemble_output[4,]
plot(shrinkage, z_score, col=c("#8F272720"), lwd=2, pch=16, cex=0.8, main="Lambda",
xlim=c(0, 1), xlab="Posterior Shrinkage", ylim=c(-5, 5), ylab="Posterior z-Score")z_score <- ensemble_output[6,]
shrinkage <- ensemble_output[7,]
plot(shrinkage, z_score, col=c("#8F272720"), lwd=2, pch=16, cex=0.8, main="Theta",
xlim=c(0, 1), xlab="Posterior Shrinkage", ylim=c(-5, 5), ylab="Posterior z-Score")Confident in the performance of the expanded model within the scope of its own assumptions we go back to fit the observed data once again.
input_data <- read_rdump('workflow.data.R')
fit <- stan(file='fit_data3_ppc.stan', data=input_data,
seed=4938483, refresh=2000)
SAMPLING FOR MODEL 'fit_data3_ppc' NOW (CHAIN 1).
Gradient evaluation took 0.00038 seconds
1000 transitions using 10 leapfrog steps per transition would take 3.8 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 1.79606 seconds (Warm-up)
1.51535 seconds (Sampling)
3.31141 seconds (Total)
SAMPLING FOR MODEL 'fit_data3_ppc' NOW (CHAIN 2).
Gradient evaluation took 0.000289 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.89 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 1.83882 seconds (Warm-up)
1.98939 seconds (Sampling)
3.82821 seconds (Total)
SAMPLING FOR MODEL 'fit_data3_ppc' NOW (CHAIN 3).
Gradient evaluation took 0.000295 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.95 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 1.79217 seconds (Warm-up)
1.46049 seconds (Sampling)
3.25266 seconds (Total)
SAMPLING FOR MODEL 'fit_data3_ppc' NOW (CHAIN 4).
Gradient evaluation took 0.000287 seconds
1000 transitions using 10 leapfrog steps per transition would take 2.87 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 1.75639 seconds (Warm-up)
1.69095 seconds (Sampling)
3.44734 seconds (Total)
The diagnostics look good,
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
and the marginal posterior distributions look reasonable.
params = extract(fit)
par(mfrow=c(2, 1))
hist(params$lambda, main="", xlab="lambda", yaxt='n', ylab="",
col=c_dark, border=c_dark_highlight)
hist(params$theta, main="", xlab="theta", yaxt='n', ylab="",
col=c_dark, border=c_dark_highlight)Excited that we might finally have an adequate model we proceed to the posterior predictive check.
B <- 30
obs_counts <- hist(input_data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))
counts <- sapply(1:4000, function(n) hist(params$y_ppc[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Posterior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y",
ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)Frustratingly we see that the tail of the posterior predictive histogram extends beyond that in the observed data. This suggests that the detector responses may be truncated and unable to record values above \(y = 14\). At this point we are suspicious of the exact measurement process but don’t have enough domain expertise to suggest an expanded model. Before considering any expansions we need to consult with the one person who was there.
A principled analysis workflow is a powerful investigatory tool for identifying subtle structure in the observed data and motivating appropriate model features. There is a fine line, however, between improving the model and overfitting to the given observation. To avoid crossing this line we have to lean on our domain expertise, and sometimess the domain expertise of others.
Suspicious of the experimental setup under which the data were collected we ask to meet with the student who ran the experiment. When we bring up the odd behavior we saw the student seems nonchalant. “Oh, you mean like when the detectors didn’t read out anything at all?” they say. “I just repeated the measurement for each detector until they reported an actual value.” they say.
And there it is. The readout system for detectors seems to be vulnerable to overloading when the counts surpass a given value, returning no value at all. When the student repeated the measurement for detectors in this state they unintentionally induced a truncated observation where the Poisson distribution is cutoff at a certain value.
The observation space does not change.
The same histogram summary remains appropriate.
We now have to augment out observational model to account for the truncation of the counts. Ideally we would treat the truncation threshold as a systematic parameter here and infer it with uncertainties to avoid overfitting to our one observation. Unfortunately that model is a bit ungainly, so for pedagogical simplification we will take the fixed threshold of \(y = 14\) implied by the observed data assuming that it too is consistent with our hypothetical domain expertise.
writeLines(readLines("sample_joint_ensemble4.stan"))data {
int N;
}
transformed data {
int U = 14;
}
generated quantities {
// Simulate model configuration from prior model
real<lower=0, upper=1> theta = beta_rng(2.8663, 2.8663);
real<lower=0> lambda = inv_gamma_rng(3.48681, 9.21604);
// Simulate data from observational model
int y[N] = rep_array(0, N);
for (n in 1:N) {
if (!bernoulli_rng(theta)) {
y[n] = U + 1;
while (y[n] > U)
y[n] = poisson_rng(lambda);
}
}
}
writeLines(readLines("fit_data4.stan"))data {
int N; // Number of observations
int y[N]; // Count at each observation
}
transformed data {
int U = 14; // Observational cutoff
}
parameters {
real<lower=0, upper=1> theta; // Excess zero probability
real<lower=0> lambda; // Poisson intensity
}
model {
// Prior model
theta ~ beta(2.8663, 2.8663);
lambda ~ inv_gamma(3.48681, 9.21604);
// Observational model that mixes a truncated Poisson with excess zeros
for (n in 1:N) {
real lpdf = poisson_lpmf(y[n] | lambda) - poisson_lcdf(U | lambda);
if (y[n] == 0)
target += log_mix(theta, 0, lpdf);
else
target += log(1 - theta) + lpdf;
}
}
Truncation already manifests in the histogram summary statistic so it remains sufficient to capture the relevant structure of the data.
We dutifully generate the joint ensemble of our expanded model.
R <- 1000
N <- 1000
simu_data <- list("N" = N)
fit <- stan(file='sample_joint_ensemble4.stan', data=simu_data,
iter=R, warmup=0, chains=1, refresh=R,
seed=4838282, algorithm="Fixed_param")
SAMPLING FOR MODEL 'sample_joint_ensemble4' NOW (CHAIN 1).
Iteration: 1 / 1000 [ 0%] (Sampling)
Iteration: 1000 / 1000 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
2.51975 seconds (Sampling)
2.51975 seconds (Total)
simu_lambdas <- extract(fit)$lambda
simu_thetas <- extract(fit)$theta
simu_ys <- extract(fit)$yThe prior predictive distribution of histograms now exhibits both zero-inflation and truncation.
B <- 26
counts <- sapply(1:R, function(r) hist(simu_ys[r,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Prior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y", ylim=c(0, max(cred[9,])), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
abline(v=25, col="white", lty=1, lw=2.5)
abline(v=25, col="black", lty=1, lw=2)With the truncation we now have exactly zero prior predictive probability above the extreme observation scale derived from our domain expertise.
length(simu_ys[simu_ys > 25]) / length(simu_ys)[1] 0
We continue to fitting each replication in our joint ensemble.
tryCatch({
registerDoParallel(makeCluster(detectCores()))
simu_list <- t(data.matrix(data.frame(simu_lambdas, simu_thetas, simu_ys)))
# Compile the posterior fit model
fit_model = stan_model(file='fit_data4.stan')
ensemble_output <- foreach(simu=simu_list,
.combine='cbind') %dopar% {
simu_lambda <- simu[1]
simu_theta <- simu[2]
simu_y <- simu[3:(N + 2)];
# Fit the simulated observation
input_data <- list("N" = N, "y" = simu_y)
capture.output(library(rstan))
capture.output(fit <- sampling(fit_model, data=input_data, seed=4938483))
# Compute diagnostics
util <- new.env()
source('stan_utility.R', local=util)
warning_code <- util$check_all_diagnostics(fit, quiet=TRUE)
# Compute rank of prior draw with respect to thinned posterior draws
sbc_rank_lambda <- sum(simu_lambda < extract(fit)$lambda[seq(1, 4000 - 8, 8)])
sbc_rank_theta <- sum(simu_theta < extract(fit)$theta[seq(1, 4000 - 8, 8)])
# Compute posterior sensitivities
s <- summary(fit, probs = c(), pars='lambda')$summary
post_mean_lambda <- s[,1]
post_sd_lambda <- s[,3]
prior_sd_lambda <- sqrt( (9.21604)**2 / ((3.48681 - 1)**2 * (3.48681 - 1)) )
z_score_lambda <- (post_mean_lambda - simu_lambda) / post_sd_lambda
shrinkage_lambda <- 1 - (post_sd_lambda / prior_sd_lambda)**2
s <- summary(fit, probs = c(), pars='theta')$summary
post_mean_theta <- s[,1]
post_sd_theta <- s[,3]
prior_sd_theta <- sqrt( (2.8663)**2 / (4 * (2.8663)**2 * (2 * 2.8663 + 1)) )
z_score_theta <- (post_mean_theta - simu_theta) / post_sd_theta
shrinkage_theta <- 1 - (post_sd_theta / prior_sd_theta)**2
c(warning_code,
sbc_rank_lambda, z_score_lambda, shrinkage_lambda,
sbc_rank_theta, z_score_theta, shrinkage_theta)
}
}, finally={ stopImplicitCluster() })Happily, the diagnostics all look clean.
warning_code <- ensemble_output[1,]
if (sum(warning_code) != 0) {
print ("Some simulated posterior fits in the joint ensemble encountered problems!")
for (r in 1:R) {
if (warning_code[r] != 0) {
print(sprintf('Replication %s of %s', r, R))
util$parse_warning_code(warning_code[r])
print(sprintf('Simulated lambda = %s', simu_lambdas[r]))
print(sprintf('Simulated theta = %s', simu_thetas[r]))
print(" ")
}
}
} else {
print ("No posterior fits in the joint ensemble encountered problems!")
}[1] "No posterior fits in the joint ensemble encountered problems!"
Similarly the simulation-based calibration histograms exhibit no problematic behavior.
sbc_rank <- ensemble_output[2,]
sbc_hist <- hist(sbc_rank, seq(0, 500, 25) - 0.5,
col=c_dark, border=c_dark_highlight, plot=FALSE)Warning in hist.default(sbc_rank, seq(0, 500, 25) - 0.5, col = c_dark,
border = c_dark_highlight, : arguments 'col', 'border' are not made use of
plot(sbc_hist, main="", xlab="Prior Rank (Lambda)", yaxt='n', ylab="")
low <- qbinom(0.005, R, 1 / 20)
mid <- qbinom(0.5, R, 1 / 20)
high <- qbinom(0.995, R, 1 / 20)
bar_x <- c(-10, 510, 500, 510, -10, 0, -10)
bar_y <- c(high, high, mid, low, low, mid, high)
polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=0, x1=500, y0=mid, y1=mid, col=c("#999999"), lwd=2)
plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)sbc_rank <- ensemble_output[5,]
sbc_hist <- hist(sbc_rank, seq(0, 500, 25) - 0.5,
col=c_dark, border=c_dark_highlight, plot=FALSE)Warning in hist.default(sbc_rank, seq(0, 500, 25) - 0.5, col = c_dark,
border = c_dark_highlight, : arguments 'col', 'border' are not made use of
plot(sbc_hist, main="", xlab="Prior Rank (Lambda)", yaxt='n', ylab="")
mean <- length(sbc_rank) / (500 / 25)
low <- mean - 3 * sqrt(mean)
high <- mean + 3 * sqrt(mean)
bar_x <- c(-10, 510, 500, 510, -10, 0, -10)
bar_y <- c(high, high, mean, low, low, mean, high)
polygon(bar_x, bar_y, col=c("#DDDDDD"), border=NA)
segments(x0=0, x1=500, y0=mean, y1=mean, col=c("#999999"), lwd=2)
plot(sbc_hist, col=c_dark, border=c_dark_highlight, add=T)Moreover, the recovered posteriors for each fit perform reasonable well.
z_score <- ensemble_output[3,]
shrinkage <- ensemble_output[4,]
plot(shrinkage, z_score, col=c("#8F272720"), lwd=2, pch=16, cex=0.8, main="Lambda",
xlim=c(0, 1), xlab="Posterior Shrinkage", ylim=c(-5, 5), ylab="Posterior z-Score")z_score <- ensemble_output[6,]
shrinkage <- ensemble_output[7,]
plot(shrinkage, z_score, col=c("#8F272720"), lwd=2, pch=16, cex=0.8, main="Theta",
xlim=c(0, 1), xlab="Posterior Shrinkage", ylim=c(-5, 5), ylab="Posterior z-Score")Exhausted, we march on to fit the observed data.
input_data <- read_rdump('workflow.data.R')
fit <- stan(file='fit_data4_ppc.stan', data=input_data,
seed=4938483, refresh=2000)
SAMPLING FOR MODEL 'fit_data4_ppc' NOW (CHAIN 1).
Gradient evaluation took 0.000974 seconds
1000 transitions using 10 leapfrog steps per transition would take 9.74 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 5.10714 seconds (Warm-up)
6.22867 seconds (Sampling)
11.3358 seconds (Total)
SAMPLING FOR MODEL 'fit_data4_ppc' NOW (CHAIN 2).
Gradient evaluation took 0.000764 seconds
1000 transitions using 10 leapfrog steps per transition would take 7.64 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 5.12871 seconds (Warm-up)
4.90172 seconds (Sampling)
10.0304 seconds (Total)
SAMPLING FOR MODEL 'fit_data4_ppc' NOW (CHAIN 3).
Gradient evaluation took 0.000748 seconds
1000 transitions using 10 leapfrog steps per transition would take 7.48 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 5.71073 seconds (Warm-up)
5.92014 seconds (Sampling)
11.6309 seconds (Total)
SAMPLING FOR MODEL 'fit_data4_ppc' NOW (CHAIN 4).
Gradient evaluation took 0.000757 seconds
1000 transitions using 10 leapfrog steps per transition would take 7.57 seconds.
Adjust your expectations accordingly!
Iteration: 1 / 2000 [ 0%] (Warmup)
Iteration: 1001 / 2000 [ 50%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 5.26315 seconds (Warm-up)
3.56016 seconds (Sampling)
8.82331 seconds (Total)
The diagnostics are clean,
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-FMI indicated no pathological behavior"
and the marginal posteriors are reasonable.
params = extract(fit)
par(mfrow=c(2, 1))
hist(params$lambda, main="", xlab="lambda", yaxt='n', ylab="",
col=c_dark, border=c_dark_highlight)
hist(params$theta, main="", xlab="theta", yaxt='n', ylab="",
col=c_dark, border=c_dark_highlight)Anxiously we move on to the posterior predictive check.
B <- 20
obs_counts <- hist(input_data$y, breaks=(0:(B + 1))-0.5, plot=FALSE)$counts
idx <- rep(0:B, each=2)
x <- sapply(1:length(idx), function(b) if(b %% 2 == 0) idx[b] + 0.5 else idx[b] - 0.5)
pad_obs <- do.call(cbind, lapply(idx, function(n) obs_counts[n + 1]))
counts <- sapply(1:4000, function(n) hist(params$y_ppc[n,], breaks=(0:(B + 1))-0.5, plot=FALSE)$counts)
probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
cred <- sapply(1:(B + 1), function(b) quantile(counts[b,], probs=probs))
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n + 1]))
plot(1, type="n", main="Posterior Predictive Distribution",
xlim=c(-0.5, B + 0.5), xlab="y",
ylim=c(0, max(c(obs_counts, cred[9,]))), ylab="")
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
lines(x, pad_obs, col="white", lty=1, lw=2.5)
lines(x, pad_obs, col="black", lty=1, lw=2)And finally we see no tension between the observed data and the posterior predictive distribution. After a few attempts we have a model that captures the intricacies of our observed data!
Satisfied in our detector detective work we send out the results to our colleagues and start drafting a note instructing students what to do when the detectors do not report any counts at all.
By very carefully following a principled Bayesian workflow we were able to answer the questions proposed in Section Two and develop a model that incorporates enough domain expertise to achieve a reasonable fit to our observed data. Indeed the progression of the workflow itself helped to inform which domain expertise we need to incorporate, and in particular when we needed to follow up with colleagues to better understand the circumstances of the experiment. By minding this domain expertise, and not simply fitting to the potentially irrelevant details of the observed data, we ensured a final model that is robust to overfitting and should generalize well to application on new observations.
That said, this workflow makes no claim that the final model is correct in any absolute sense. Our ability to critique the model is limited by the specific questions we ask and the information encoded in the observed data. Indeed more thorough questions or additional data might demonstrate limitations of our concluding model and the need for further iterations. Our detectors, for example, may not have identical responses, and those individual responses may not be so static in time. Similarly the source being measured may not emit particles uniformly in time or space. A principled workflow doesn’t yield the correct model but rather a model that meets the needs of a given inferential task.
While the workflow is structured, it is not automatic and requires careful and customized consideration of the available domain expertise in the scope of the modeling assumptions at each step. Moreover the workflow isn’t cheap, requiring significant computational resources to fully exploit it even on this relatively simple example. That investment of time and resources, however, provides a powerful model robust to many of the pathologies that can lead to biased inferences and faulty scientific insights.
I thank Sean Talts, Marc Dotson, and Tomi Peltola for helpful comments as well as the feedback from those who have attended my courses and talks.
A very special thanks to everyone supporting me on Patreon: Aki Vehtari, Austin Rochford, Bo Schwartz Madsen, Cat Shark, Charles Naylor, Colin Carroll, Daniel Simpson, David Pascall, David Roher, Elad Berkman, Finn Lindgren, Granville Matheson, Hernan Bruno, J Michael Burgess, Joel Kronander, Jonas Beltoft Gehrlein, Joshua Mayer, Justin Bois, Lars Barquist, Marek Kwiatkowski, Maurits van der Meer, Maxim Kesin, Michael Dillon, Ole Rogeberg, Oscar Olvera, Riccardo Fusaroli, Richard Torkar, Sam Petulla, Sam Zorowitz, Seo-young Kim, Seth Axen, Simon Duane, Stephen Oates, Stijn, and Vladislavs Dovgalecs.
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
devtools::session_info("rstan")Session info -------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2018-10-15
Packages -----------------------------------------------------------------
package * version date source
BH 1.65.0-1 2017-08-24 CRAN (R 3.4.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
dichromat 2.0-0 2013-01-24 CRAN (R 3.4.0)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
graphics * 3.4.2 2017-10-04 local
grDevices * 3.4.2 2017-10-04 local
grid 3.4.2 2017-10-04 local
gridExtra 2.3 2017-09-09 CRAN (R 3.4.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
inline 0.3.14 2015-04-13 CRAN (R 3.4.0)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
MASS 7.3-47 2017-02-26 CRAN (R 3.4.2)
Matrix 1.2-11 2017-08-21 CRAN (R 3.4.2)
methods 3.4.2 2017-10-04 local
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
rlang 0.1.4 2017-11-05 CRAN (R 3.4.2)
rstan * 2.17.3 2018-01-20 CRAN (R 3.4.2)
scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
StanHeaders * 2.17.2 2018-01-20 CRAN (R 3.4.2)
stats * 3.4.2 2017-10-04 local
stats4 3.4.2 2017-10-04 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
tibble 1.3.4 2017-08-22 CRAN (R 3.4.1)
tools 3.4.2 2017-10-04 local
utils * 3.4.2 2017-10-04 local
viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.0)
Betancourt, Michael. 2015. “A Unified Treatment of Predictive Model Comparison,” June.
———. 2018. “Calibrating Model-Based Inferences and Decisions,” March.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2017. “Visualization in Bayesian Workflow,” September.
Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Generally Only Be Understood in the Context of the Likelihood,” August.
Good, I.J. 1950. Probability and the Weighing of Evidence. New York: Hafners.
Rubin, Donald B. 1984. “Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician.” Ann. Statist. 12 (4): 1151–72.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G. Martins, and Sigrunn H. Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statist. Sci. 32 (1): 1–28.
Talts, Sean, Michael Betancourt, Daniel Simpson, Aki Vehtari, and Andrew Gelman. 2018. “Validating Bayesian Inference Algorithms with Simulation-Based Calibration,” April.